Assignment 2¶

In [2]:
library(tidyverse)
library(repr)
library(infer)
library(cowplot)
library(broom)
library(GGally)
library(AER)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Attaching package: ‘cowplot’


The following object is masked from ‘package:lubridate’:

    stamp


Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Loading required package: car

Loading required package: carData


Attaching package: ‘car’


The following object is masked from ‘package:dplyr’:

    recode


The following object is masked from ‘package:purrr’:

    some


Loading required package: lmtest

Loading required package: zoo


Attaching package: ‘zoo’


The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric


Loading required package: sandwich

Loading required package: survival

Demonstrate that the dataset can be read from the web into R.

In [3]:
data <- read.csv(url("https://raw.githubusercontent.com/SiyingLiu03/GroupProjectSTAT301/main/HR-Employee-Attrition.csv"))
head(data)
A data.frame: 6 × 35
AgeAttritionBusinessTravelDailyRateDepartmentDistanceFromHomeEducationEducationFieldEmployeeCountEmployeeNumber⋯RelationshipSatisfactionStandardHoursStockOptionLevelTotalWorkingYearsTrainingTimesLastYearWorkLifeBalanceYearsAtCompanyYearsInCurrentRoleYearsSinceLastPromotionYearsWithCurrManager
<int><chr><chr><int><chr><int><int><chr><int><int>⋯<int><int><int><int><int><int><int><int><int><int>
141YesTravel_Rarely 1102Sales 12Life Sciences11⋯1800 801 6405
249No Travel_Frequently 279Research & Development81Life Sciences12⋯4801103310717
337YesTravel_Rarely 1373Research & Development22Other 14⋯2800 733 0000
433No Travel_Frequently1392Research & Development34Life Sciences15⋯3800 833 8730
527No Travel_Rarely 591Research & Development21Medical 17⋯4801 633 2222
632No Travel_Frequently1005Research & Development22Life Sciences18⋯3800 822 7736

The dataset contains the information about every employees, such as their personal information and information about their status in the company. The focus of dataset is the attrition status and can be used to help HR to decide to retain this employee or not. There are 1470 observations in the dataset and 35 variables. Below is the description of dataset。

Variable Type Description
Age continuous Age of Employee
Attrition categorical If the employee has been quited or some other reason
BusinessTravel categorical Rate of business travel
DailyRate continuous Amount of pay per day
Department categorical Department of employee
DistanceFromHome continuous Distance from home to work in miles
Education categorical Education level
EducationField categorical Field of study
EmployeeCount continuous Count of employee
EmployeeNumber categorical Employee number
EnvironmentSatisfaction categorical Employee satisfaction(from 1 to 4) at work
Gender categorical gender of employee
HourlyRate continuous employee hourly rate
JobInvolvement categorical Level of involvment
JobLevel categorical Level of senority
JobRole categorical Employee's role in the company
JobSatisfaction categorical Satisfaction of employee
MaritalStatus categorical Marital Status of employee
MonthlyIncome continuous Monthly income of an employee
MonthlyRate continuous Monthly rate of an employee
NumCompaniesWorked continuous Number of companies that employee has worked at before
Over18 categorical Employee is over 18 or not
OverTime categorical Whether the employee works overtime
PercentSalaryHike continuous Percent of salary hike
PerformanceRating categorical performance rate
RelationshipSatisfaction categorical satisfaction of relationship
StandardHours continuous per week standard work hours
StockOptionLevel categorical company stock option level
TotalWorkingYears continuous total working years
TrainingTimesLastYear continuous training time
WorkLifeBalance categorical balance between work and life
YearAtCompany continuous total years at current company
YearsInCurrentRole continuous total years in current role
YearsSinceLastPromotion continuous number of years since last promotion
YearsWithCurrManager continuous Years working under the current manager

Question: How is attrition (as our response variable) affected by the explanatary variables: Age,Education,EducationField, DailyRate, Gender, OverTime, PerformanceRating, YearsAtCompany, YearsSinceLastPromotion The question will be focused on prediction (select a model that can predict the attrition status).

Data Cleaning¶

Clean and wrangle the data into a tidy format.

In [39]:
# Convert the categorical variables in the dataset into factors
data$Attrition <- factor(data$Attrition)
data$BusinessTravel <- factor(data$BusinessTravel)
data$Department <- factor(data$Department)
data$EducationField <- factor(data$EducationField)
data$EmployeeCount <- factor(data$EmployeeCount)
data$EnvironmentSatisfaction <- factor(data$EnvironmentSatisfaction)
data$Gender <- factor(data$Gender)
data$Education <- factor(data$Education)
data$JobInvolvement <- factor(data$JobInvolvement)
data$JobLevel <- factor(data$JobLevel)
data$PerformanceRating <- factor(data$PerformanceRating)
data$JobRole <- factor(data$JobRole)
data$JobSatisfaction <- factor(data$JobSatisfaction)
data$MaritalStatus <- factor(data$MaritalStatus)
data$Over18 <- factor(data$Over18)
data$OverTime <- factor(data$OverTime)
data$RelationshipSatisfaction <- factor(data$RelationshipSatisfaction)
data$StockOptionLevel <- factor(data$StockOptionLevel)
data$WorkLifeBalance <- factor(data$WorkLifeBalance)

head(data)
A data.frame: 6 × 35
AgeAttritionBusinessTravelDailyRateDepartmentDistanceFromHomeEducationEducationFieldEmployeeCountEmployeeNumber⋯RelationshipSatisfactionStandardHoursStockOptionLevelTotalWorkingYearsTrainingTimesLastYearWorkLifeBalanceYearsAtCompanyYearsInCurrentRoleYearsSinceLastPromotionYearsWithCurrManager
<int><fct><fct><int><fct><int><fct><fct><fct><int>⋯<fct><int><fct><int><int><fct><int><int><int><int>
141YesTravel_Rarely 1102Sales 12Life Sciences11⋯1800 801 6405
249No Travel_Frequently 279Research & Development81Life Sciences12⋯4801103310717
337YesTravel_Rarely 1373Research & Development22Other 14⋯2800 733 0000
433No Travel_Frequently1392Research & Development34Life Sciences15⋯3800 833 8730
527No Travel_Rarely 591Research & Development21Medical 17⋯4801 633 2222
632No Travel_Frequently1005Research & Development22Life Sciences18⋯3800 822 7736
In [40]:
#Remove rows with missing values
hrdata <- na.omit(data)
In [41]:
#Get the number of observations we have
n <- nrow(hrdata)
n
1470
In [42]:
# Get column names
column_names <- colnames(hrdata)
column_names
  1. 'Age'
  2. 'Attrition'
  3. 'BusinessTravel'
  4. 'DailyRate'
  5. 'Department'
  6. 'DistanceFromHome'
  7. 'Education'
  8. 'EducationField'
  9. 'EmployeeCount'
  10. 'EmployeeNumber'
  11. 'EnvironmentSatisfaction'
  12. 'Gender'
  13. 'HourlyRate'
  14. 'JobInvolvement'
  15. 'JobLevel'
  16. 'JobRole'
  17. 'JobSatisfaction'
  18. 'MaritalStatus'
  19. 'MonthlyIncome'
  20. 'MonthlyRate'
  21. 'NumCompaniesWorked'
  22. 'Over18'
  23. 'OverTime'
  24. 'PercentSalaryHike'
  25. 'PerformanceRating'
  26. 'RelationshipSatisfaction'
  27. 'StandardHours'
  28. 'StockOptionLevel'
  29. 'TotalWorkingYears'
  30. 'TrainingTimesLastYear'
  31. 'WorkLifeBalance'
  32. 'YearsAtCompany'
  33. 'YearsInCurrentRole'
  34. 'YearsSinceLastPromotion'
  35. 'YearsWithCurrManager'

Then, select the variables we are interested and change the names of columns.

In [43]:
hrdata <- hrdata |>
        select(Age,Attrition,Education,EducationField, DailyRate,
              Gender, OverTime, 
               PerformanceRating, YearsAtCompany, YearsSinceLastPromotion)
hrdata <- hrdata %>%
  rename(age = Age,
         attrition = Attrition,
         edu = Education,
         edufield = EducationField,
         daily = DailyRate,
         gender = Gender,
         overtime = OverTime,
         perrating = PerformanceRating,
         yratcom = YearsAtCompany,
         yrsincepromotion = YearsSinceLastPromotion
        )
head(hrdata)
A data.frame: 6 × 10
ageattritioneduedufielddailygenderovertimeperratingyratcomyrsincepromotion
<int><fct><fct><fct><int><fct><fct><fct><int><int>
141Yes2Life Sciences1102FemaleYes3 60
249No 1Life Sciences 279Male No 4101
337Yes2Other 1373Male Yes3 00
433No 4Life Sciences1392FemaleYes3 83
527No 1Medical 591Male No 3 22
632No 2Life Sciences1005Male No 3 73

Visualization¶

Propose a visualization that I consider relevant to address my question or to explore the data.

After we have a tidy data, we can visualize the dataset between variables.

We need to see that which variables are relevant and detect the potential problems in the model.

First, we see the variables that are potentially associated with the response variable (attrition).

To see if there is a difference or any correlation between age and attrition status, we make a boxplot to view.

In [8]:
#Age vs Attrition
age_attrition_boxplots <- hrdata %>%
   ggplot() +
   geom_boxplot(aes(x = attrition, y = age)) +
   theme(
     text = element_text(size = 22),
     plot.title = element_text(face = "bold"),
     axis.title = element_text(face = "bold")
   ) +
   ggtitle("Boxplot of Age vs Attrition") +
   xlab("Attrtion") +
   ylab("Age of employees")
age_attrition_boxplots
No description has been provided for this image

The median of age of people with attrition is smaller than the people without attrition which may indicate that the company intends to keep their employees. And these two attrition status have similar range.

To view the count of each gender in people with attrition and without attrition

In [24]:
age_attrition_plot <- ggplot(hrdata, aes(x = attrition, fill = gender)) +
  geom_bar(position = 'dodge') +
  labs(title = 'Plot of Attrition and Gender', x = 'Attrition', y = 'Count') +
  theme_minimal()

age_attrition_plot
No description has been provided for this image

In both attrition status, male employees are more than female employees.

To view what the proportion of education level is in each attrition status

In [33]:
edu_attrition_plot <- ggplot(hrdata, aes(x = attrition, fill = edu)) +
  geom_bar(position = 'dodge') +
  labs(title = 'Plot of Attrition and Education level', x = 'Attrition', y = 'Count') +
  theme_minimal()

edu_attrition_plot
No description has been provided for this image

It can be told from the plot that the company perfers employee with education level 3. And both graphs have similar distribution.

To view what the proportion of education field is in each attrition status

In [34]:
edufield_attrition_plot <- ggplot(hrdata, aes(x = attrition, fill = edufield)) +
  geom_bar(position = 'dodge') +
  labs(title = 'Plot of Attrition and Education Field', x = 'Attrition', y = 'Count') +
  theme_minimal()

edufield_attrition_plot
No description has been provided for this image

The company perfers to keep employees that are in life science field and marketing area.

To visualize the overTime variable and attrition status.

In [36]:
overTime_attrition_plot <- ggplot(hrdata, aes(x = attrition, fill = overtime)) +
  geom_bar(position = 'dodge') +
  labs(title = 'Plot of Attrition and Overtime', x = 'Attrition', y = 'Count') +
  theme_minimal()

overTime_attrition_plot
No description has been provided for this image

It can be seen that employees with attrition have more overtime than the people with no attrition, which maybe is a reason that they leave the company.

To view the daily rate and attrition status, we make a boxplot.

In [10]:
daily_attrition_boxplots <- hrdata %>%
   ggplot() +
   geom_boxplot(aes(x = attrition, y = daily)) +
   theme(
     text = element_text(size = 22),
     plot.title = element_text(face = "bold"),
     axis.title = element_text(face = "bold")
   ) +
   ggtitle("Boxplot of Daily rate vs Attrition") +
   xlab("Attrtion") +
   ylab("Daily rate of employees")
daily_attrition_boxplots
No description has been provided for this image

The two groups of attrition have similar daily rate.

To view the work years in company variable and attrition status, we make a boxplot to see if there is a difference between the people with attrition and without it.

In [38]:
yratcom_attrition_boxplots <- hrdata %>%
   ggplot() +
   geom_boxplot(aes(x = attrition, y = yratcom)) +
   theme(
     text = element_text(size = 22),
     plot.title = element_text(face = "bold"),
     axis.title = element_text(face = "bold")
   ) +
   ggtitle("Boxplot of years at company vs Attrition") +
   xlab("Attrtion") +
   ylab("years of employees at company")
yratcom_attrition_boxplots
No description has been provided for this image

From the boxplot, it can be seen that the company would like to keep the employee in a long term and whenever they realise this employee is not suitable for the company, the employee has to leave the company, which I think is a reason of low median of yes attrition.

Visualize the performance rating variable versus attrition

In [44]:
perrating_attrition_plot <- ggplot(hrdata, aes(x = attrition, fill = perrating)) +
  geom_bar(position = 'dodge') +
  labs(title = 'Plot of Attrition and performance ', x = 'Attrition', y = 'Count') +
  theme_minimal()

perrating_attrition_plot
No description has been provided for this image

Visualize the year since last promotion variable versus attrition

In [47]:
yrsincepromotion_attrition_boxplots <- hrdata %>%
   ggplot() +
   geom_boxplot(aes(x = attrition, y = yrsincepromotion)) +
   theme(
     text = element_text(size = 22),
     plot.title = element_text(face = "bold"),
     axis.title = element_text(face = "bold")
   ) +
   ggtitle("Boxplot of year since promotion vs Attrition") +
   xlab("Attrtion") +
   ylab("years since last promotion")
yrsincepromotion_attrition_boxplots
No description has been provided for this image

Two groups have similar numeber of years since last promotion.

To see is there any correlation between work years variable and years in current role variable, we make a scatter plot filled by attrition. Based on this, we can know that how the work years in the company and years in current role affect the attrition status, such as an employee leaves the company because they have worked in the company for a long time but still in a same position for a long time.

In [53]:
ggplot(hrdata, aes(x = yratcom, y = yrsincepromotion, fill = attrition)) +
  geom_point(shape = 21, size = 3) +  # Use shape 21 for filled circles
  labs(title = "years at company vs years since last promotion", x = "years at company", y = "years since last promotion")
No description has been provided for this image

Since we need to consider about multicollinearity, I make a correlation ggpair graph and a correlation matrix to visualize and analyse it.

Explore the correlation between numerical variables.

In [54]:
options(repr.plot.width = 15, repr.plot.height = 12)
num_hrdata <- hrdata |> select(age,attrition,daily,yratcom,yrsincepromotion)

hrdata_pair_plots <- num_hrdata %>%
   select(- attrition) %>% 
   ggpairs(progress = FALSE) +
   theme(
     text = element_text(size = 15),
     plot.title = element_text(face = "bold"),
     axis.title = element_text(face = "bold")
   )
 hrdata_pair_plots

corr_matrix_HRdata <- num_hrdata %>%
   select(- attrition) %>% 
   cor() %>%
   as.data.frame() %>%
   rownames_to_column("var1") %>%
   pivot_longer(-var1, names_to = "var2", values_to = "corr")
 head(corr_matrix_HRdata)
A tibble: 6 × 3
var1var2corr
<chr><chr><dbl>
age age 1.00000000
age daily 0.01066094
age yratcom 0.31130877
age yrsincepromotion0.21651337
dailyage 0.01066094
dailydaily 1.00000000
No description has been provided for this image
In [56]:
# Based on the given table, we conduct a correlation matrix.
options(repr.plot.width = 15, repr.plot.height = 10)

 plot_corr_matrix_HRdata<- corr_matrix_HRdata %>%
   ggplot(aes(x = var1, y = var2)) +
   geom_tile(aes(fill = corr), color = "white") +
   scale_fill_distiller("Correlation Coefficient \n",
     palette =  "YlOrRd",
     direction = 1, limits = c(-1,1)
   ) +
   labs(x = "Variable 1", y = "Variable 2") +
   theme_minimal() +
   theme(
     axis.text.x = element_text(
       angle = 45, vjust = 1,
       size = 18, hjust = 1
     ),
     axis.text.y = element_text(
       vjust = 1,
       size = 18, hjust = 1
     ),
     legend.title = element_text(size = 18, face = "bold"),
     legend.text = element_text(size = 18),
     legend.key.size = unit(2, "cm")
   ) +
   coord_fixed() +
   geom_text(aes(x = var1, y = var2, label = round(corr, 2)), color = "black", size = 6)
 plot_corr_matrix_HRdata
No description has been provided for this image

As you can see, there are several variables that have high correlation or multicollinearity which will be considered in the next step of this project. It is also necessary to find out is there any correlation between dependent variable and independent variable, in order to determine which the confounders are.

In [ ]: